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Abstract 

Feedforward networks are a class of approximation techniques that can be used to learn to perform some 
tasks from a finite set of examples. The question of the capability of a network to generalize from a finite 
training set to unseen data is clearly of crucial importance. In this paper, we bound the generalization 
error of a class of Radial Basis Functions, for certain well defined function learning tasks, in terms of the 
number of parameters and number of examples. We show that the total generalization error is partly due 
to the insufficient representational capacity of the network (because of the finite size of the network being 
used) and partly due to insufficient information about the target function because of the finite number of 
samples. Prior research has looked at representational capacity or sample complexity in isolation. In the 
spirit of A. Barron, H. White and S. Geman we develop a framework to look at both. While the bound 
that we derive is specific for Radial Basis Functions, a number of observations deriving from it apply 
to any approximation technique. Our result also sheds light on ways to choose an appropriate network 
architecture for a particular problem and the kinds of problems which can be effectively solved with finite 
resources, i.e., with finite number of parameters and finite amounts of data. 
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1 Introduction 

Many problems in learning theory can be effectively 
modelled as learning an input output mapping on the 
basis of limited evidence of what this mapping might be. 
The mapping usually takes the form of some unknown 
function between two spaces and the evidence is often a 
set of labelled, noisy, examples i.e., (x, y) pairs which are 
consistent with this function. On the basis of this data 
set, the learner tries to infer the true function. 

Such a scenario of course exists in a wide range of 
scientific disciplines. For example, in speech recogni- 
tion, there might exist some functional relationship be- 
tween sounds and their phonetic identities. We are given 
(sound, phonetic identity) pairs from which we try to in- 
fer the underlying function. This example from speech 
recogniton belongs to a large class of pattern classifica- 
tion problems where the patterns could be visual, acous- 
tic, or tactile. In economics, it is sometimes of interest 
to predict the future foreign currency rates on the ba- 
sis of the past time series. There might be a function 
which captures the dynamical relation between past and 
future currency rates and one typically tries to uncover 
this relation from data which has been appropriately pro- 
cessed. Similarly in medicine, one might be interested in 
predicting whether or not breast cancer will recur in a 
patient within five years after her treatment. The input 
space might involve dimensions like the age of the pa- 
tient, whether she has been through menopause, the ra- 
diation treatment previously used etc. The output space 
would be single dimensional boolean taking on values de- 
pending upon whether breast cancer recurs or not. One 
might collect data from case histories of patients and try 
to uncover the underlying function. 

The unknown target function is assumed to belong to 
some class T which using the terminology of computa- 
tional learning theory we call the concept class. Typi- 
cal examples of concept classes are classes of indicator 
functions, boolean functions, Sobolev spaces etc. The 
learner is provided with a finite data set. One can make 
many assumptions about how this data set is collected 
but a common assumption which would suffice for our 
purposes is that the data is drawn by sampling inde- 
pendently the input output space (X x Y) according 
to some unknown probability distribution. On the ba- 
sis of this data, the learner then develops a hypothesis 
(another function) about the identity of the target func- 
tion i.e., it comes up with a function chosen from some 
class, say H (the hypothesis class) which best fits the 
data and postulates this to be the target. Hypothesis 
classes could also be of different kinds. For example, 
they could be classes of boolean functions, polynomials, 
linear functions, spline functions and so on. One such 
class which is being increasingly used for learning prob- 
lems is the class of feedforward networks [53], [43], [35]. A 
typical feedforward network is a parametrized function 
of the form 

n 

/( x ) = ^2ciH(x;wi) 

8 = 1 

where {c 8 '}" =1 and {w 8 '}" =1 are free parameters and 



H(-; ■) is a given, fixed function (the "activation func- 
tion"). Depending on the choice of the activation func- 
tion one gets different network models, such as the most 
common form of "neural networks" , the Multilayer Per- 
ceptron [74, 18, 51, 43, 44, 30, 57, 56, 46], or the Radial 
Basis Functions network [14, 26, 39, 40, 58, 70, 59, 67, 
66, 32, 35]. 

If, as more and more data becomes available, the 
learner's hypothesis becomes closer and closer to the tar- 
get and converges to it in the limit, the target is said to 
be learnable. The error between the learner's hypothesis 
and the target function is defined to be the generalization 
error and for the target to be learnable the generaliza- 
tion error should go to zero as the data goes to infinity. 
While learnability is certainly a very desirable quality, it 
requires the fulfillment of two important criteria. 

First, there is the issue of the representational ca- 
pacity (or hypothesis complexity) of the hypothesis class. 
This must have sufficient power to represent or closely 
approximate the concept class. Otherwise for some tar- 
get function /, the best hypothesis h in H might be far 
away from it. The error that this best hypothesis makes 
is formalized later as the approximation error. In this 
case, all the learner can hope to do is to converge to h 
in the limit of infinite data and so it will never recover 
the target. Second, we do not have infinite data but 
only some finite random sample set from which we con- 
struct a hypothesis. This hypothesis constructed from 
the finite data might be far from the best possible hy- 
pothesis, h, resulting in a further error. This additional 
error (caused by finiteness of data) is formalized later as 
the estimation error. The amount of data needed to en- 
sure a small estimation error is referred to as the sample 
complexity of the problem. The hypothesis complexity, 
the sample complexity and the generalization error are 
related. If the class H is very large or in other words 
has high complexity, then for the same estimation error, 
the sample complexity increases. If the hypothesis com- 
plexity is small, the sample complexity is also small but 
now for the same estimation error the approximation er- 
ror is high. This point has been developed in terms of 
the Bias- Variance trade-off by Geman et al [31] in the 
context of neural networks, and others [72, 38, 80, 75] in 
statistics in general. 

The purpose of this paper is two-fold. First, we for- 
malize the problem of learning from examples so as to 
highlight the relationship between hypothesis complex- 
ity, sample complexity and total error. Second, we ex- 
plore this relationship in the specific context of a partic- 
ular hypothesis class. This is the class of Radial Basis 
function networks which can be considered to belong to 
the broader class of feed-forward networks. Specifically, 
we are interested in asking the following questions about 
radial basis functions. 

Imagine you were interested in solving a particular 
problem (regression or pattern classification) using Ra- 
dial Basis Function networks. Then, how large must the 
network be and how many examples do you need to draw 
so that you are guaranteed with high confidence to do 
very well? Conversely, if you had a finite network and 
a finite amount of data, what are the kinds of problems 



you could solve effectively? 

Clearly, if one were using a network with a finite 
number of parameters, then its representational capac- 
ity would be limited and therefore even in the best case 
we would make an approximation error. Drawing upon 
results in approximation theory [55] several researchers 
[18, 41, 6, 44, 15, 3, 57, 56, 46, 76] have investigated 
the approximating power of feedforward networks show- 
ing how as the number of parameters goes to infinity, 
the network can approximate any continuous function. 
These results assume infinite data and questions of learn- 
ability from finite data are ignored. For a finite net- 
work, due to finiteness of the data, we make an error 
in estimating the parameters and consequently have an 
estimation error in addition to the approximation er- 
ror mentioned earlier. Using results from Vapnik and 
Chervonenkis [80, 81, 82, 83] and Pollard [69], work has 
also been done [42, 9] on the sample complexity of finite 
networks showing how as the data goes to infinity, the 
estimation error goes to zero i.e., the empirically opti- 
mized parameter settings converge to the optimal ones 
for that class. However, since the number of parameters 
are fixed and finite, even the optimal parameter setting 
might yield a function which is far from the target. This 
issue is left unexplored by Haussler [42] in an excellent 
investigation of the sample complexity question. 

In this paper, we explore the errors due to both finite 
parameters and finite data in a common setting. In order 
for the total generalization error to go to zero, both the 
number of parameters and the number of data have to 
go to infinity, and we provide rates at which they grow 
for learnability to result. Further, as a corollary, we are 
able to provide a principled way of choosing the optimal 
number of parameters so as to minimize expected errors. 
It should be mentioned here that White [85] and Barron 
[7] have provided excellent treatments of this problem 
for different hypothesis classes. We will mention their 
work at appropriate points in this paper. 

The plan of the paper is as follows: in section 2 we 
will formalize the problem and comment on issues of a 
general nature. We then provide in section 3 a precise 
statement of a specific problem. In section 4 we present 
our main result, whose proof is postponed to appendix D 
for continuity of reading. The main result is qualified by 
several remarks in section 5. In section 6 we will discuss 
what could be the implications of our result in practice 
and finally we conclude in section 7 with a reiteration of 
our essential points. 



Definitions and Statement of the 
Problem 



In order to make a precise statement of the problem we 
first need to introduce some terminology and to define 
a number of mathematical objects. A summary of the 
most common notations and definitions used in this pa- 
per can be found in appendix A. 



2.1 Random Variables and Probability 
Distributions 

Let X and Y be two arbitrary sets. We will call x 
and y the independent variable and response respectively, 
where x and y range over the generic elements of X and 
Y . In most cases X will be a subset of a fc-dimensional 
Euclidean space and Y a subset of the real line, so that 
the independent variable will be a fc-dimensional vec- 
tor and the response a real number. We assume that a 
probability distribution -P(x, y) is defined on X x Y . P 
is unknown, although certain assumptions on it will be 
made later in this section. 

The probability distribution -P(x, y) can also be writ- 
ten as 1 : 

P(x,i/) = P(x)P(i/|x), (1) 

where _P(j/|x) is the conditional probability of the re- 
sponse y given the independent variable x, and -P(x) 
is the marginal probability of the independent variable 
given by: 



P(x) 



dy -P(x, y) 



Expected values with respect to -P(x, y) or -P(x) will be 
always indicated by E[-]. Therefore, we will write: 



E\g{x,y)] 



and 



E[h(x)] 



XxY 



X 



ixdy P(x,y)g(x,y) 



ix P(x)/i(x) 



for any arbitrary function g or h. 

2.2 Learning from Examples and Estimators 

The framework described above can be used to model 
the fact that in the real world we often have to deal with 
sets of variables that are related by a probabilistic rela- 
tionship. For example, y could be the measured torque 
at a particular joint of a robot arm, and x the set of an- 
gular position, velocity and acceleration of the joints of 
the arm in a particular configuration. The relationship 
between x and y is probabilistic because there is noise 
affecting the measurement process, so that two different 
torques could be measured given the same configuration. 
In many cases we are provided with examples of this 
probabilistic relationship, that is with a data set Di, ob- 
tained by sampling / times the set XxY according to 
P(x,y): 

D l = {( Xl ,y l )eXxY}l =1 . 

From eq. (1) we see that we can think of an element 
(xj, yi) of the data set D\ as obtained by sampling X 
according to -P(x), and then sampling Y according to 
_P(j/|x). In the robot arm example described above, it 
would mean that one could move the robot arm into 



J Note that we are assuming that the conditional distribu- 
tion exists, but this is not a very restrictive assumption. 



a random configuration xi , measure the corresponding 
torque y\, and iterate this process / times. 

The interesting problem is, given an instance of x that 
does not appear in the data set D\, to give an estimate 
of what we expect y to be. For example, given a certain 
configuration of the robot arm, we would like to estimate 
the corresponding torque. 

Formally, we define an estimator to be any function 
f : X —>■ Y . Clearly, since the independent variable x 
need not determine uniquely the response y, any esti- 
mator will make a certain amount of error. However, it 
is interesting to study the problem of finding the best 
possible estimator, given the knowledge of the data set 
D\, and this problem will be defined as the problem of 
learning from examples, where the examples are repre- 
sented by the data set D\. Thus we have a probabilistic 
relation between x and y. One can think of this as an 
underlying deterministic relation corrupted with noise. 
Hopefully a good estimator will be able to recover this 
relation. 

2.3 The Expected Risk and the Regression 
Function 

In the previous section we explained the problem of 
learning from examples and stated that this is the same 
as the problem of finding the best estimator. To make 
sense of this statement, we now need to define a mea- 
sure of how good an estimator is. Suppose we sample 
X x Y according to -P(x, y), obtaining the pair (x, y). A 
measure 2 of the error of the estimator / at the point x 
is: 

(j/"/(x)) 2 . 
In the example of the robot arm, /(x) is our estimate of 
the torque corresponding to the configuration x, and y is 
the measured torque of that configuration. The average 
error of the estimator / is now given by the functional 



I[f] = E[(y-f( X )f 



ixdyP(x,y)(y-f(x)f 



XxY 



that is usually called the expected risk of / for the specific 
choice of the error measure. 

Given this particular measure as our yardstick to eval- 
uate different estimators, we are now interested in find- 
ing the estimator that minimizes the expected risk. In 
order to proceed we need to specify its domain of def- 
inition T . Then using the expected risk as a criterion, 
we could obtain the best element of T . Depending on 
the properties of the unknown probability distribution 
_P(x, y) one could make different choices for T . We will 
assume in the following that T is some space of differ- 
entiable functions. For example, T could be a space of 
functions with a certain number of bounded derivatives 
(the spaces A m (R d ) defined in appendix A), or a Sobolev 
space of functions with a certain number of derivatives 
in L p (the spaces H m ' p (R d ) defined in appendix A). 



2 Note that this is the familiar squared-error and when 
averaged over its domain yields the mean squared error for a 
particular estimator, a very common choice. However, it is 
useful to remember that there could be other choices as well. 



Assuming that the problem of minimizing /[/] in T is 
well posed, it is easy to obtain its solution. In fact, the 
expected risk can be decomposed in the following way 
(see appendix B): 

/[/] = £[(/ (x) - /(x)) 2 ] + E[(y - / (x)) 2 ] (2) 

where /o(x) is the so called regression function, that is 
the conditional mean of the response given the indepen- 
dent variable: 

/o(x)=y dyyP(y\x) . (3) 

From eq. (2) it is clear that the regression function is 
the function that minimizes the expected risk in T , and 
is therefore the best possible estimator. Hence, 

/o(x) = arg min /[/] . 

However, it is also clear that even the regression func- 
tion will make an error equal to E[(y — /o(x)) 2 ], that 
is the variance of the response given a certain value for 
the independent variable, averaged over the values the 
independent variable can take. While the first term in 
eq. (2) depends on the choice of the estimator /, the sec- 
ond term is an intrinsic limitation that comes from the 
fact that the independent variable x does not determine 
uniquely the response y. 

The problem of learning from examples can now be 
reformulated as the problem of reconstructing the re- 
gression function /o , given the example set D\ . Thus we 
have some large class of functions T to which the target 
function /o belongs. We obtain noisy data of the form 
(x, y) where x has the distribution -P(x) and for each x, 
y is a random variable with mean /o(x) and distribution 
_P(j/|x). We note that y can be viewed as a determin- 
istic function of x corrupted by noise. If one assumes 
the noise is additive, we can write y = /o(x) + r\ x where 
r/ x 3 is zero-mean with distribution P(y\x). We choose an 
estimator on the basis of the data set and we hope that 
it is close to the regression (target) function. It should 
also be pointed out that this framework includes pat- 
tern classification and in this case the regression (target) 
function corresponds to the Bayes discriminant function 
[36, 45, 71]. 

2.4 The Empirical Risk 

If the expected risk functional /[/] were known, one 
could compute the regression function by simply finding 
its minimum in T , that would make the whole learning 
problem considerably easier. What makes the problem 
difficult and interesting is that in practice /[/] is un- 
known because -P(x, y) is unknown. Our only source of 
information is the data set D\ which consists of / inde- 
pendent random samples of X x Y drawn according to 
P(x,y). Using this data set, the expected risk can be 
approximated by the empirical risk I emp : 



Note that the standard regression problem often assumes 
r\ x is independent of x. Our case is distribution free because 
we make no assumptions about the nature of r\ x . 



Jemp[/] = jj^iyi -/(Xi)) 



For each given estimator /, the empirical risk is a random 
variable, and under fairly general assumptions 4 , by the 
law of large numbers [23] it converges in probability to 
the expected risk as the number of data points goes to 
infinity: 



lim P{\I[f] 

I— ^OO 



I, 



> s} = Ve > 



(4) 



Therefore a common strategy consists in estimating the 
regression function as the function that minimizes the 
empirical risk, since it is "close" to the expected risk if 
the number of data is high enough. For the error metric 
we have used, this yields the least-squares error estima- 
tor. However, eq. (4) states only that the expected risk 
is "close" to the empirical risk for each given /, and not 
for all / simultaneously. Consequently the fact that the 
empirical risk converges in probability to the expected 
risk when the number, /, of data points goes to infinity 
does not guarantee that the minimum of the empirical 
risk will converge to the minimum of the expected risk 
(the regression function). As pointed out and analyzed 
in the fundamental work of Vapnik and Chervonenkis 
[81, 82, 83] the notion of uniform convergence in prob- 
ability has to be introduced, and it will be discussed in 
other parts of this paper. 

2.5 The Problem 

The argument of the previous section suggests that an 
approximate solution of the learning problem consists in 
finding the minimum of the empirical risk, that is solving 

min7 emp [/] . 

However this problem is clearly ill-posed, because, for 
most choices of T, it will have an infinite number of 
solutions. In fact, all the functions in T that interpolate 
the data points (x 8 -, j/ 8 ), that is with the property 

/(xi) = y { l,...,l 

will give a zero value for I emp . This problem is very 
common in approximation theory and statistics and can 
be approached in several ways. A common technique 
consists in restricting the search for the minimum to a 
smaller set than T . We consider the case in which this 
smaller set is a family of parametric functions, that is a 
family of functions defined by a certain number of real 
parameters. The choice of a parametric representation 
also provides a convenient way to store and manipulate 
the hypothesis function on a computer. 

We will denote a generic subset of T whose elements 
are parametrized by a number of parameters propor- 
tional to n, by H n . Moreover, we will assume that the 
sets H n form a nested family, that is 



H l C H 2 C . . . C H n C . . . C H. 

For example, H n could be the set of polynomials in one 
variable of degree n — 1, Radial Basis Functions with n 
centers, multilayer perceptrons with n sigmoidal hidden 
units, multilayer perceptrons with n threshold units and 
so on. Therefore, we choose as approximation to the 
regression function the function f n \ defined as: 5 



In 



arg min 7 emp [/] 



(5) 



Thus, for example, if H n is the class of functions which 
can be represented as / = X2«=i c aH(x;w a ) then eq. 
(5) can be written as 

/„,( = arg min I emp [f] • 

c ft ,W a 

A number of observations need to be made here. First, 
if the class T is small (typically in the sense of bounded 
VC-dimension or bounded metric entropy [69]), then the 
problem is not necessarily ill-posed and we do not have to 
go through the process of using the sets H n . However, as 
has been mentioned already, for most interesting choices 
of T (e.g. classes of functions in Sobolev spaces, con- 
tinuous functions etc.) the problem might be ill posed. 
However, this might not be the only reason for using the 
classes H n . It might be the case that that is all we have 
or for some reason it is something we would like to use. 
For example, one might want to use a particular class of 
feed-forward networks because of ease of implementation 
in VLSI. Also, if we were to solve the function learning 
problem on a computer as is typically done in practice, 
then the functions in T have to be represented some- 
how. We might consequently use H n as a representation 
scheme. It should be pointed out that the sets H n and 
T have to be matched with each other. For example, 
we would hardly use polynomials as an approximation 
scheme when the class T consists of indicator functions 
or for that matter use threshold units when the class T 
contains continuous functions. In particular, if we are to 
recover the regression function, H must be dense in T . 
One could look at this matching from both directions. 
For a class T , one might be interested in an appropriate 
choice of H n . Conversely, for a particular choice of H n , 
one might ask what classes T can be effectively solved 
with this scheme. Thus, if we were to use multilayer 
perceptrons, this line of questioning would lead us to 
identify the class of problems which can be effectively 
solved by them. 

Thus, we see that in principle we would like to min- 
imize /[/] over the large class T obtaining thereby the 



Notice that we are implicitly assuming that the problem 
of minizing J emp [/] over H n has a solution, which might not 
be the case. However the quantity 



E n ,l 



inf Jem 

fen, 



Af] 



4 For example, assuming the data is independently drawn 
and /[/] is finite. 



is always well defined, and we can always find a function /„_; 
for which I e mp[fn,i] is arbitrarily close to E n j. It will turn 
out that this is sufficient for our purposes, and therefore we 
will continue, assuming that /„_; is well defined by eq. (5) 



regression function /o. What we do in practice is to min- 
imize the empirical risk I emp [/] over the smaller class H n 
obtaining the function /„;. Assuming we have solved all 
the computational problems related to the actual com- 
putation of the estimator /„/, the main problem is now: 

how good is f i? 

Independently of the measure of performance that we 
choose when answering this question, we expect f n j to 
become a better and better estimator as n and / go to 
infinity. In fact, when / increases, our estimate of the ex- 
pected risk improves and our estimator improves. The 
case of n is trickier. As n increases, we have more param- 
eters to model the regression function, and our estimator 
should improve. However, at the same time, because we 
have more parameters to estimate with the same amount 
of data, our estimate of the expected risk deteriorates. 
Thus we now need more data and n and / have to grow 
as a function of each other for convergence to occur. 
At what rate and under what conditions the estimator 
f n i improves depends on the properties of the regression 
function, that is on T , and on the approximation scheme 
we are using, that is on H n . 

2.6 Bounding the Generalization Error 

At this stage it might be worthwhile to review and re- 
mark on some general features of the problem of learning 
from examples. Let us remember that our goal is to min- 
imize the expected risk /[/] over the set T . If we were to 
use a finite number of parameters, then we have already 
seen that the best we could possibly do is to minimize 
our functional over the set H n , yielding the estimator 

In • 



expressed in terms of the expected risk using the 
decomposition (2) as 



E[(f0 ~ fnf] = /[/„] " I[f ] 



(6) 



In 



arg min I\f] 



However, not only is the parametrization limited, but 
the data is also finite, and we can only minimize the 
empirical risk / emp , obtaining as our final estimate the 
function /„;. Our goal is to bound the distance from 
f n i that is our solution, from /o, that is the "optimal" 
solution. If we choose to measure the distance in the 
L 2 (P) metric (see appendix A), the quantity that we 
need to bound, that we will call generalization error, is: 

E[(fo ~ fn,,) 2 } = J x rfx P(X)(/ (X) - / ni! (x)) 2 = 
= ||/o - /n,(|| L 2(p) 

There are 2 main factors that contribute to the gener- 
alization error, and we are going to analyze them sepa- 
rately for the moment. 

1. A first cause of error comes from the fact that 
we are trying to approximate an infinite dimen- 
sional object, the regression function /o 6 T , with 
a finite number of parameters. We call this er- 
ror the approximation error, and we measure it by 
the quantity E[(fo — f„) 2 ], that is the L'jiP) dis- 
tance between the best function in H n and the re- 
gression function. The approximation error can be 



Notice that the approximation error does not de- 
pend on the data set D\, but depends only on the 
approximating power of the class H n . The natural 
framework to study it is approximation theory, that 
abound with bounds on the approximation error for 
a variety of choices of H n and T . In the following 
we will always assume that it is possible to bound 
the approximation error as follows: 

E[(fo - fnf] < e(n) 

where e(n) is a function that goes to zero as n goes 
to infinity if H is dense in T . In other words, 
as shown in figure (1), as the number n of pa- 
rameters gets larger the representation capacity of 
H n increases, and allows a better and better ap- 
proximation of the regression function /o . This is- 
sue has been studied by a number of researchers 
[18, 44, 6, 8, 30, 57, 56] in the neural networks com- 
munity. 

2. Another source of error comes from the fact that, 
due to finite data, we minimize the empirical risk 
7 emp [/], and obtain /„;, rather than minimizing 
the expected risk /[/], and obtaining /„. As the 
number of data goes to infinity we hope that f n j 
will converge to /„ , and convergence will take place 
if the empirical risk converges to the expected risk 
uniformly in probability [80]. The quantity 

IW/]-/[/]| 

is called estimation error, and conditions for the 
estimation error to converge to zero uniformly in 
probability have been investigated by Vapnik and 
Chervonenkis [81, 82, 80, 83] Pollard [69], Dudley 
[24], and Haussler [42]. Under a variety of different 
hypothesis it is possible to prove that, with proba- 
bility l — 6,a bound of this form is valid: 

IW/]-/[/]|<w(/,M) v/e# n (7) 

The specific form of w depends on the setting of the 
problem, but, in general, we expect u>(l, n, 6) to be 
a decreasing function of/. However, we also expect 
it to be an increasing function of n. The reason 
is that, if the number of parameters is large then 
the expected risk is a very complex object, and then 
more data will be needed to estimate it. Therefore, 
keeping fixed the number of data and increasing the 
number of parameters will result, on the average, 
in a larger distance between the expected risk and 
the empirical risk. 

The approximation and estimation error are clearly 
two components of the generalization error, and it is in- 
teresting to notice, as shown in the next statement, the 
generalization error can be bounded by the sum of the 
two: 



Statement 2.1 The following inequality holds: 

\\fo-fn,i\\ 2 L 2 (P) <s(n) + 2io(l,n,8) . (8) 

Proof: using the decomposition of the expected risk (2), 
the generalization error can be written as: 

ll/o " fn,,\\h (P) = E[(f - f n ,,) 2 } = /[/„,,] - I[f ] . (9) 

A natural way of bounding the generalization error is as 
follows: 

E[(f0 ~ fn,,f] < \I[fn] ~ I[f ]\+ \I[fn] " /[/„,,] | • (10) 

In the first term of the right hand side of the previous 
inequality we recognize the approximation error (6). If 
a bound of the form (7) is known for the generalization 
error, it is simple to show (see appendix (C) that the 
second term can be bounded as 

\I[f„]-I[f„,i}\<2io(l,n,8) 

and statement (2.1) follows □. 

Thus we see that the generalization error has two com- 
ponents: one, bounded by s(n), is related to the approxi- 
mation power of the class of functions {H n }, and is stud- 
ied in the framework of approximation theory. The sec- 
ond, bounded by u>(l, n, 8), is related to the difficulty of 
estimating the parameters given finite data, and is stud- 
ied in the framework of statistics. Consequently, results 
from both these fields are needed in order to provide an 
understanding of the problem of learning from examples. 
Figure (1) also shows a picture of the problem. 

2.7 A Note on Models and Model Complexity 

From the form of eq. (8) the reader will quickly realize 
that there is a trade-off between n and / for a certain 
generalization error. For a fixed /, as n increases, the 
approximation error e(n) decreases but the estimation 
error u>(l, n, 8) increases. Consequently, there is a certain 
n which might optimally balance this trade-off. Note 
that the classes H n can be looked upon as models of 
increasing complexity and the search for an optimal n 
amounts to a search for the right model complexity. One 
typically wishes to match the model complexity with the 
sample complexity (measured by how much data we have 
on hand) and this problem is well studied [29, 75, 52, 73, 
4, 28, 17] in statistics. 

Broadly speaking, simple models would have high 
approximation errors but small estimation errors while 
complex models would have low approximation errors 
but high estimation errors. This might be true even 
when considering qualitatively different models and as 
an illustrative example let us consider two kinds of mod- 
els we might use to learn regression functions in the 
space of bounded continuous functions. The class of 
linear models, i.e., the class of functions which can be 
expressed as / = w-x-l-^, do not have much approximat- 
ing power and consequently their approximation error is 
rather high. However, their estimation error is quite low. 
The class of models which can be expressed in the form 
H = X2i=i c i sm ( w i • x + #;) have higher approximating 




Figure 1: This figure shows a picture of the problem. 
The outermost circle represents the set F. Embedded in 
this are the nested subsets, the H n 's. /o is an arbitrary 
target function in T , f n is the closest element of H n and 
f n i is the element of H n which the learner hypothesizes 
on the basis of data. 



power [47] resulting in low approximation errors. How- 
ever this class has an infinite VC-dimension [82] and its 
estimation error can not therefore be bounded. 

So far we have provided a very general characteriza- 
tion of this problem, without stating what the sets T 
and H n are. As we have already mentioned before, the 
set T could be a set of bounded differentiable or inte- 
grable functions, and H n could be polynomials of degree 
n, spline functions with n knots, multilayer perceptrons 
with n hidden units or any other parametric approxima- 
tion scheme with n parameters. In the next section we 
will consider a specific choice for these sets, and we will 
provide a bound on the generalization error of the form 
ofeq. (8). 

3 Stating the Problem for Radial Basis 
Functions 

As mentioned before the problem of learning from exam- 
ples reduces to estimating some target function from a 
set A to a set Y . In most practical cases, such as char- 
acter recognition, motor control, time series prediction, 
the set X is the fc-dimensional Euclidean space R k , and 
the set Y is some subset of the real line, that for our pur- 
poses we will assume to be the interval [— M, M], where 
M is some positive number. In fact, there is a probability 
distribution -P(x, y) defined on the space R k x [-M, M] 
according to which the labelled examples are drawn in- 
dependently at random, and from which we try to esti- 
mate the regression (target) function. It is clear that the 
regression function is a real function of k variables. 
In this paper we focus our attention on the Radial Ba- 



sis Functions approximation scheme (also called Hyper- 
Basis Functions [67]). This is the class of approximating 
functions that can be written as: 



/(x) 



E 

8 = 1 



A-G(x-ti) 



where G is some given basis function and the /?; and 
the tj are free parameters. We would like to understand 
what classes of problems can be solved "well" by this 
technique, where "well" means that both approximation 
and estimation bounds need to be favorable. We will see 
later that a favorable approximation bound can be ob- 
tained if we assume that the class of functions T to which 
the regression function belongs is defined as follows: 



T = {/ e L 2 (R k )\f = A * G, \\\ Rk < M} 



(11) 



Here A is a signed Radon measure on the Borel sets of 
R k , G is a gaussian function with range in [0,V], the 
symbol * stands for the convolution operation, |A|^k is 
the total variation 6 of the measure A and M is a positive 
real number. We point out that the class T is non-trivial 
to learn in the sense that it has infinite pseudo-dimension 
[69]. 

In order to obtain an estimation bound we need the 
approximating class to have bounded variation, and the 
following constraint will be imposed: 



E 

8 = 1 



\/3i\<M 



We will see in the proof that this constraint does not 
affect the approximation bound, and the two pieces fit 
together nicely. Thus the set H n is defined now as the 
set of functions belonging to L 2 such that 



/(x) = £/?,-G(x - t,-), J2 l#l ^ M < *•' e Rk ( 12 ) 



that P must be such that E'fj/lx] belongs to T . Notice 
also that since we assumed that Y is a closed interval, 
we are implicitly assuming that _P(j/|x) has compact sup- 
port. 

Assuming now that we have been able to solve the 
minimization problem of eq. (13), the main question we 
are interested in is "how far is f n j from /o?". We give 
an answer in the next section. 

4 Main Result 

The main theorem is: 

Theorem 4.1 For any < 6 < 1, for n nodes, I data 
points, input dimensionality of k, and H n , T , /o, f n ,i also 
as defined in the statement of the problem above, with 
probability greater than 1 — 6, 



||/o - fn,l\\ L 2(p) 



<0 - 



o 



nk ln(n/) — In 6 



1/2N 



Proof: The proof requires us to go through a series of 
propositions and lemmas which have been relegated to 
appendix (D) for continuity of ideas. □ 

5 Remarks 

There are a number of comments we would like to make 
on the formulation of our problem and the result we 
have obtained. There is a vast body of literature on 
approximation theory and the theory of empirical risk 
minimization. In recent times, some of the results in 
these areas have been applied by the computer science 
and neural network community to study formal learning 
models. Here we would like to make certain observations 
about our result, suggest extensions and future work, 
and to make connections with other work done in related 



Having defined the sets H n and T we remind the reader 
that our goal is to recover the regression function, that is 
the minimum of the expected risk over T . What we end 
up doing is to draw a set of / examples and to minimize 
the empirical risk I emp over the set H n , that is to solve 
the following non-convex minimization problem: 

l n 

f„ t i = arg min VYy* - T^ l3 a G(xi - t„)) 2 (13) 

' ' 8=1 a=l 

Notice that assumption that the regression function 

/o(x) = E[y\ X ] 

belongs to the class T correspondingly implies an as- 
sumption on the probability distribution _P(j/|x), viz., 



6 A signed measure A can be decomposed by the Hahn- 
Jordan decomposition into A = A — A - . Then |A| = A + A - 
is called the total variation of A. See Dudley [23] for more 
information. 



5.1 Observations on the Main Result 

• The theorem has a PAC[79] like setting. It tells 
us that if we draw enough data points (labelled 
examples) and have enough nodes in our Radial 
Basis Functions network, we can drive our error 
arbitrarily close to zero with arbitrarily high prob- 
ability. Note however that our result is not en- 
tirely distribution-free. Although no assumptions 
are made on the form of the underlying distribu- 
tion, we do have certain constraints on the kinds 
of distributions for which this result holds. In par- 
ticular, the distribution is such that its conditional 
mean E'fj/lx] (this is also the regression function 
fo(x)) must belong to a the class of functions T de- 
fined by eq. (11). Further the distribution _P(j/|x) 
must have compact support 7 . 



This condition, that is related to the problem of large de- 
viations [80], could be relaxed, and will be subject of further 
investigations. 



• The error bound consists of two parts, one 
(0(l/n)) coming from approximation theory, and 
the other 0(((nkln(nl) + \n{\/ b))/l) l l 2 ) from 
statistics. It is noteworthy that for a given approx- 
imation scheme (corresponding to {H n }), a certain 
class of functions (corresponding to T) suggests it- 
self. So we have gone from the class of networks 
to the class of problems they can perform as op- 
posed to the other way around, i.e., from a class of 
problems to an optimal class of networks. 

• This sort of a result implies that if we have the 
prior knowledge that /o belongs to class T , then 
by choosing the number of data points, /, and the 
number of basis functions, n, appropriately, we can 
drive the misclassification error arbitrarily close to 
Bayes rate. In fact, for a fixed amount of data, 
even before we have started looking at the data, 
we can pick a starting architecture, i.e., the num- 
ber of nodes, n, for optimal performance. After 
looking at the data, we might be able to do some 
structural risk minimization [80] to further improve 
architecture selection. For a fixed architecture, this 
result sheds light on how much data is required for 
a certain error performance. Moreover, it allows us 
to choose the number of data points and number of 
nodes simultaneously for guaranteed error perfor- 
mances. Section 6 explores this question in greater 
detail. 

5.2 Extensions 

• There are certain natural extensions to this work. 
We have essentially proved the consistency of the 
estimated network function /„;. In particular we 
have shown that f n j converges to /o with proba- 
bility 1 as / and n grow to infinity. It is also pos- 
sible to derive conditions for almost sure conver- 
gence. Further, we have looked at a specific class 
of networks ({H n }) which consist of weighted sums 
of Gaussian basis functions with moving centers 
but fixed variance. This kind of an approximation 
scheme suggests a class of functions T which can 
be approximated with guaranteed rates of conver- 
gence as mentioned earlier. We could prove similar 
theorems for other kinds of basis functions which 
would have stronger approximation properties than 
the class of functions considered here. The general 
principle on which the proof is based can hopefully 
be extended to a variety of approximation schemes. 

• We have used notions of metric entropy and cover- 
ing number [69, 24] in obtaining our uniform con- 
vergence results. Haussler [42] uses the results of 
Pollard and Dudley to obtain uniform convergence 
results and our techniques closely follow his ap- 
proach. It should be noted here that Vapnik [80] 
deals with exactly the same question and uses the 
VC-dimension instead. It would be interesting to 
compute the VC-dimension of the class of networks 
and use it to obtain our results. 

• While we have obtained an upper bound on the er- 
ror in terms of the number of nodes and examples, 



it would be worthwhile to obtain lower bounds on 
the same. Such lower bounds do not seem to exist 
in the neural network literature to the best of our 
knowledge. 

• We have considered here a situation where the es- 
timated network i.e., f n j is obtained by minimiz- 
ing the empirical risk over the class of functions 
H n . Very often, the estimated network is obtained 
by minimizing a somewhat different objective func- 
tion which consists of two parts. One is the fit to 
the data and the other is some complexity term 
which favours less complex (according to the de- 
fined notion of complexity) functions over more 
complex ones. For example the regularization ap- 
proach [77, 68, 84] minimizes a cost function of the 
form 

N 

#[/] = X>.--/(x,-) + A$[/] 

8 = 1 

over the class H = U n >i_ff n . Here A is the so 
called "regularization parameter" and $[/] is a 
functional which measures smoothness of the func- 
tions involved. It would be interesting to obtain 
convergence conditions and rates for such schemes. 
Choice of an optimal A is an interesting question 
in regularization techniques and typically cross- 
validation or other heuristic schemes are used. A 
result on convergence rate potentially offers a prin- 
cipled way to choose A. 

• Structural risk minimization is another method 
to achieve a trade-off between network complex- 
ity (corresponding to n in our case) and fit to 
data. However it does not guarantee that the ar- 
chitecture selected will be the one with minimal 
parametrization 8 . In fact, it would be of some 
interest to develop a sequential growing scheme. 
Such a technique would at any stage perform a se- 
quential hypothesis test [37]. It would then decide 
whether to ask for more data, add one more node 
or simply stop and output the function it has as 
its e-good hypothesis. In such a process, one might 
even incorporate active learning [2, 62] so that if the 
algorithm asks for more data, then it might even 
specify a region in the input domain from where it 
would like to see this data. It is conceivable that 
such a scheme would grow to minimal parametriza- 
tion (or closer to it at any rate) and require less 
data than classical structural risk minimization. 

• It should be noted here that we have assumed that 
the empirical risk ^2 i=1 (yi — f{%i)) 2 can be min- 
imized over the class H n and the function f n j be 
effectively computed. While this might be fine in 
principle, in practice only a locally optimal solu- 
tion to the minimization problem is found (typi- 
cally using some gradient descent schemes). The 



Neither does regularization for that matter. The ques- 
tion of minimal parametrization is related to that of order 



determination of systems, a very difficult problem 



computational complexity of obtaining even an ap- 
proximate solution to the minimization problem is 
an interesting one and results from computer sci- 
ence [49, 12] suggest that it might in general be 
#_P-hard. 

5.3 Connections with Other Results 

• In the neural network and computational learning 
theory communities results have been obtained per- 
taining to the issues of generalization and learn- 
ability. Some theoretical work has been done 
[10, 42, 61] in characterizing the sample complex- 
ity of finite sized networks. Of these, it is worth- 
while to mention again the work of Haussler [42] 
from which this paper derives much inspiration. 
He obtains bounds for a fixed hypothesis space i.e. 
a fixed finite network architecture. Here we deal 
with families of hypothesis spaces using richer and 
richer hypothesis spaces as more and more data 
becomes available. Later we will characterize the 
trade-off between hypothesis complexity and error 
rate. Others [27, 63] attempt to characterize the 
generalization abilities of feed-forward networks us- 
ing theoretical formalizations from statistical me- 
chanics. Yet others [13, 60, 16, 1] attempt to obtain 
empirical bounds on generalization abilities. 

• This is an attempt to obtain rate-of-convergence 
bounds in the spirit of Barron's work [5], but using 
a different approach. We have chosen to combine 
theorems from approximation theory (which gives 
us the 0(1 1 n) term in the rate, and uniform con- 
vergence theory (which gives us the other part). 
Note that at this moment, our rate of convergence 
is worse than Barron's. In particular, he obtains a 
rate of convergence of 0(1/ 'n + (nk ln(/))//). Fur- 
ther, he has a different set of assumptions on the 
class of functions (corresponding to our T). Fi- 
nally, the approximation scheme is a class of net- 
works with sigmoidal units as opposed to radial- 
basis units and a different proof technique is used. 
It should be mentioned here that his proof relies 
on a discretization of the networks into a countable 
family, while no such assumption is made here. 

• It would be worthwhile to make a reference to Ge- 
man's paper [31] which talks of the Bias- Variance 
dilemma. This is another way of formulating the 
trade-off between the approximation error and the 
estimation error. As the number of parameters 
(proportional to n) increases, the bias (which can 
be thought of as analogous to the approximation 
error) of the estimator decreases and its variance 
(which can be thought of as analogous to the esti- 
mation error) increases for a fixed size of the data 
set. Finding the right bias- variance trade-off is very 
similar in spirit to finding the trade-off between 
network complexity and data complexity. 

• Given the class of radial basis functions we are us- 
ing, a natural comparison arises with kernel regres- 
sion [50, 22] and results on the convergence of ker- 
nel estimators. It should be pointed out that, un- 



like our scheme, Gaussian- kernel regressors require 
the variance of the Gaussian to go to zero as a func- 
tion of the data. Further the number of kernels is 
always equal to the number of data points and the 
issue of trade-off between the two is not explored 
to the same degree. 

• In our statement of the problem, we discussed how 
pattern classification could be treated as a spe- 
cial case of regression. In this case the function 
/o corresponds to the Bayes a-posteriori decision 
function. Researchers [71, 45, 36] in the neural 
network community have observed that a network 
trained on a least square error criterion and used 
for pattern classification was in effect computing 
the Bayes decision function. This paper provides a 
rigorous proof of the conditions under which this is 
the case. 

6 Implications of the Theorem in 
Practice: Putting In the Numbers 

We have stated our main result in a particular form. We 
have provided a provable upper bound on the error (in 
the || . ||l 2 (p) metric) in terms of the number of exam- 
ples and the number of basis functions used. Further we 
have provided the order of the convergence and have not 
stated the constants involved. The same result could be 
stated in other forms and has certain implications. It 
provides us rates at which the number of basis functions 
(n) should increase as a function of the number of exam- 
ples (/) in order to guarantee convergence(Section 6.1). 
It also provides us with the trade-offs between the two 
as explored in Section 6.2. 

6.1 Rate of Growth of n for Guaranteed 
Convergence 

From our theorem (4.1) we see that the generalization er- 
ror converges to zero only if n goes to infinity more slowly 
than /. In fact, if n grows too quickly the estimation er- 
ror u>(l, n, 8) will diverge, because it is proportional to n. 
In fact, setting n = V , we obtain 



lim ; ^ +00 w(l, n, 6) 
= lim ; ^ +00 O 



l r kln(l r+1 )+ln(l/5) 
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Therefore the condition r < 1 should hold in order to 
guarantee convergence to zero. 

6.2 Optimal Choice of n 

In the previous section we made the point that the num- 
ber of parameters n should grow more slowly than the 
number of data points /, in order to guarantee the con- 
sistency of the estimator /„;. It is quite clear that there 
is an optimal rate of growth of the number of parame- 
ters, that, for any fixed amount of data points /, gives 
the best possible performance with the least number of 
parameters. In other words, for any fixed / there is an 



optimal number of parameters n*(l) that minimizes the 
generalization error. That such a number should exist 
is quite intuitive: for a fixed number of data, a small 
number of parameters will give a low estimation error 
u>(l,n,8), but very high approximation error s(n), and 
therefore the generalization error will be high. If the 
number of parameters is very high the approximation 
error e(n) will be very small, but the estimation error 
ijj{l, n, 8) will be high, leading to a large generalization er- 
ror again. Therefore, somewhere in between there should 
be a number of parameters high enough to make the ap- 
proximation error small, but not too high, so that these 
parameters can be estimated reliably, with a small esti- 
mation error. This phenomenon is evident from figure 
(2), where we plotted the generalization error as a func- 
tion of the number of parameters n for various choices 
of sample size /. Notice that for a fixed sample size, the 
error passes through a minimum. Notice that the loca- 
tion of the minimum shifts to the right when the sample 
size is increased. 
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Figure 2: Bound on the generalization error as a function 
of the number of basis functions n keeping the sample 
size / fixed. This has been plotted for a few different 
choices of sample size. Notice how the generalization er- 
ror goes through a minimum for a certain value of n. 
This would be an appropriate choice for the given (con- 
stant) data complexity. Note also that the minimum is 
broader for larger /, that is, an accurate choice of n is 
less critical when plenty of data is available. 



In order to find out exactly what is the optimal rate of 
growth of the network size we simply find the minimum 
of the generalization error as a function of n keeping 
the sample size / fixed. Therefore we have to solve the 
equation: 



d 



E[(fo ~ fn,lf] = 



for n as a function of /. Substituting the bound given in 
theorem (4.1) in the previous equation, and setting all 
the constants to 1 for simplicity, we obtain: 
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Performing the derivative the expression above can be 
written as 



kn ln(n/) — In 8 



2 k 



[ln(n/) + 1] 



We now make the assumption that / is big enough to 
let us perform the approximation ln(n/) + 1 « ln(n/). 
Moreover, we assume that 



1 



<< (nl) 



nk 



in such a way that the term including 8 in the equa- 
tion above is negligible. After some algebra we therefore 
conclude that the optimal number of parameters n*(l) 
satisfies, for large /, the equation: 



n*(l) 



Al 



_k\n(n*{l)l)_ 

From this equation is clear that n* is roughly propor- 
tional to a power of /, and therefore we can neglect the 
factor n* in the denominator of the previous equation, 
since it will only affect the result by a multiplicative con- 
stant. Therefore we conclude that the optimal number 
of parameters n*(l) for a given number of examples be- 
haves as 



n* (I) oc 
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In order to show that this is indeed the optimal rate of 
growth we reported in figure (3) the generalization error 
as function of the number of examples / for different 
rate of growth of n, that is setting n = V for different 
values of r. Notice that the exponent r = g, that is very 
similar to the optimal rate of eq. (14), performs better 
than larger (r = -|) and smaller (r = y-j) exponents. 
While a fixed sample size suggests the scheme above for 
choosing an optimal network size, it is important to note 
that for a certain confidence rate (8) and for a fixed error 
rate (e), there are various choices of n and / which are 
satisfactory. Fig. 4 shows n as a function of /, in other 
words (/, n) pairs which yield the same error rate with 
the same confidence. 

If data are expensive for us, we could operate in region 
A of the curve. If network size is expensive we could 
operate in region B of the curve. In particular the eco- 
nomics of trading off network and data complexity would 
yield a suitable point on this curve and thus would allow 
us to choose the right combination of n and / to solve 
our regression problem with the required accuracy and 
confidence. 



Of course we could also plot the error as a function of 
data size / for a fixed network size (n) and this has been 
don e for various choices of n in Fig. 5. 




I (number of examples) 

Figure 3: The bound on the generalization error as a 
function of the number of examples for different choices 
of the rate at which network size n increases with sam- 
ple size /. Notice that if n = /, then the estimator is not 
guaranteed to converge, i.e., the bound on the general- 
ization error diverges. While this is a distribution free- 
upper bound, we need distribution-free lower bounds as 
well to make the stronger claim that n = / will never 
converge. 




Figure 5: The generalization error as a function of num- 
ber of examples keeping the number of basis functions 
(n) fixed. This has been done for several choices of n. As 
the number of examples increases to infinity the general- 
ization error asymptotes to a minimum which is not the 
Bayes error rate because of finite hypothesis complexity 
(finite n). 
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Figure 4: This figures shows various choices of (7, n) 
which give the same generalization error. The «-axis 
has been plotted on a log scale. The interesting obser- 
vation is that there are an infinite number of choices for 
number of basis functions and number of data points all 
of which would guarantee the same generalization error 
(in terms of its worst case bound). 
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We see as expected that the error monotonically de- 
creases as a function of /. However it asymptotically 
decreases not to the Bayes error rate but to some value 
above it (the approximation error) which depends upon 
the the network complexity. 

Finally figure (6) shows the result of theorem (4.1) 
in a 3-dimensional plot. The generalization error, the 
network size, and the sample size are all plotted as a 
function of each other. 

7 Conclusion 

For the task of learning some unknown function from 
labelled examples where we have multiple hypothesis 
classes of varying complexity, choosing the class of right 
complexity and the appropriate hypothesis within that 
class poses an interesting problem. We have provided an 
analysis of the situation and the issues involved and in 
particular have tried to show how the hypothesis com- 
plexity, the sample complexity and the generalization 
error are related. We proved a theorem for a special 
set of hypothesis classes, the radial basis function net- 
works and we bound the generalization error for certain 
function learning tasks in terms of the number of param- 
eters and the number of examples. This is equivalent to 



A Notations 




• A: a set of functions defined on S such that, for 
any a £ A, 

< a(0 < u 2 V^ e s . 

• ,4?: the restriction of A to the data set, see eq. 
(22). 

• B: it will usually indicate the set of all possible 
/-dimensional Boolean vectors. 

• B: a generic e-separated set in S. 

• C(e,A,di,i): the metric capacity of a set A endowed 
with the metric d^itpy 

• d(-, •): a metric on a generic metric space S. 

• d^i(-,-), d^i(p\(-, •): L 1 metrics in vector spaces. 
The definition depends on the space on which the 
metric is defined (k-th dimensional vectors, real 
valued functions, vector valued functions). 

1. In a vector space R k we have 



Figure 6: The generalization error, the number of ex- 
amples (7) and the number of basis functions (n) as a 
function of each other. 



obtaining a bound on the rate at which the number of 
parameters must grow with respect to the number of ex- 
amples for convergence to take place. Thus we use richer 
and richer hypothesis spaces as more and more data be- 
come available. We also see that there is a tradeoff be- 
tween hypothesis complexity and generalization error for 
a certain fixed amount of data and our result allows us 
a principled way of choosing an appropriate hypothesis 
complexity (network architecture). The choice of an ap- 
propriate model for empirical data is a problem of long- 
standing interest in statistics and we provide connections 
between our work and other work in the field. 



Acknowledgments We are grateful to T. Poggio and 
B. Caprile for useful discussions and suggestions. 



where x, y£fi',i' 1 and y^ denote their //-th 
components. 
2. In an infinite dimensional space T of real val- 
ued functions in k variables we have 



dmp)(f,g) 



R k 



|/(x)- fl (x)|dP(x) 



where j ', g £ T and dP(x) is a probability 
measure on R k . 
3. In an infinite dimensional space T of func- 
tions in k variables with values in R n we have 



1 " f 
d L i (P )(f,g)=-£ / |/,-(x)- fl ,-(x)|dP(x) 

n i=i jRk 

where 

f(x) = (/ 1 (x),.../ 8 (x),.../ n (x)), g(x) = 
(ffi(x), . . .fifi(x), . . .ffn(x)) are elements of T 
and dP(x) is a probability measure on R k . 

• Df. it will always indicate a data set of / points: 

D l = {( Xl ,y l )eXxY}' l=1 . 

The points are drawn according to the probability 
distribution P(pz,y). 

• E[-\. it denotes the expected value with respect to 
the probability distribution -P(x, y). For example 



I[f] = E[(y-f(x)f 



and 
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ll/o 



/ll 2 



LHP) 



i?[(/o(x)-/(x)) 2 



• /: a generic estimator, that is any function from 
X to Y: 



• G'i- it is a set of real valued functions in k variables 
defined as 



f-X^Y . 

• /o(x): the regression function, it is the conditional 
mean of the response given the predictor: 



/o(x) = J dy yP(y\ X ) . 

It can also be defined as the function that mini- 
mizes the expected risk /[/] in U , that is 



G 2 = {ae->:/eGi, 



1 



27T(7 



} 



where a is the standard deviation of the Gaussian 
G. 

• Hi: it is a class of vector valued functions 



k t r>n 



g(x) : R k -+ R 



of the form 



/o(x) = arg inf /[/] . 
jfc u 

Whenever the response is obtained sampling a 
function h in presence of zero mean noise the re- 
gression function coincides with the sampled func- 
tion h. 

• f n : it is the function that minimizes the expected 
risk /[/] in H n : 



f n = arg inf /[/] 



Since 



j[/] = ll/o-/lli a( P) + /[/o] 

f n it is also the best L 2 (P) approximation to the 
regression function in H n (see figure 1). 

• /„;: is the function that minimizes the empirical 
risk 7 emp [/] in H n : 

/„,( = arg inf I emp [f] 

f€H n 

In the neural network language it is the output of 
the network after training has occurred. 

• T: the space of functions to which the regression 
function belongs, that is the space of functions we 
want to approximate. 

T :X ^ Y 

where X £ R d and Y £ R. T could be for example 
a set of differentiable functions, or some Sobolev 
space H m -P(R k ) 

• Q: it is a class of functions of k variables 



g:R k ^ [0, V] 



defined as 



0=={«/:«/(x) = G(||x-t||), t£R k }. 

where G is the gaussian function. 

• G\. it is a k + 2-dimensional vector space of func- 
tions from R k to R defined as 

G\ = spanjl, x , x , •, x , ||x|| } 

where x £ R k and x^ is the //-th component of the 
vector x. 

-L O 



g(x) = (G(||x-t 1 ||),G(||x-t 2 ||),...,G(||x-t n ||)) 

where G is the gaussian function and the t; are 
arbitrary fc-dimensional vectors. 

• Hp\ it is a class of real valued functions in n vari- 
ables: 



/ : [0, Vf -+ R 



of the form 



/(x) = P • x 
where /3 = (/3i ,...,/?„) is an arbitrary n- 
dimensional vector that satisfies the constraint 



8 = 1 



\/3i\<M 



• H n : a subset of T, whose elements are 
parametrized by a number of parameters propor- 
tional to n. We will assume that the sets H n form 
a nested family, that is 

HiC H 2 C ...C H n C ... . 
For example H n could be the set of polynomials 
in one variable of degree n — 1, Radial Basis Func- 
tions with n centers or multilayer perceptrons with 
n hidden units. Notice that for Radial Basis Func- 
tions with moving centers and Multilayer percep- 
trons the number of parameters of an element of 
H n is not n, but it is proportional to n (respec- 
tively n(k + l) and n(k + 2), where k is the number 
of variables). 

• H: it is defined as H = U n=1 H n , and it is identi- 
fied with the approximation scheme. If H n is the 
set of polynomials in one variable of degree n — 1, 
H is the set of polynomials of any degree. 

• H m ' p (R k ): the Sobolev space of functions in k 
variables whose derivatives up to order m are in 

LP(R k ). 

• /[/]: the expected risk, defined as 



I[f] 



ixrfj/P(x,j/)(j/-/(x)) 2 



XxY 



where / is any function for which this expression 
is well defined. It is a measure of how well the 
function / predicts the response y. 



• Iemp[f]'- the empirical risk. It is a functional on U 
defined as 



i emp [f] = ^j2( yi -f( Xl )f , 

8 = 1 

where {(x;, 2/i)}''=i * s a se ^ of data randomly drawn 
from X x Y according to the probability distribu- 
tion P(pz,y). It is an approximate measure of the 
expected risk, since it converges to /[/] in proba- 
bility when the number of data points / tends to 
infinity. 

• k: it will always indicate the number of indepen- 
dent variables, and therefore the dimensionality of 
the set X. 

• /: it will always indicate the number of data points 
drawn from X according to the probability distri- 
bution -P(x). 

• L 2 (P): the set of function whose square is inte- 
grable with respect to the measure defined by the 
probability distribution P. The norm in L 2 (P) is 
therefore defined by 



\l 2 (p) 



R k 



ix P(x)/ 2 (x) 



• A m (R k )(Mo, Mi, M 2 , . . . , M m ): the space of func- 
tions in k variables whose derivatives up to order 
m are bounded: 



\D a f\<M H \a\ 
where a is a multi-index. 



1,2, 



• M: a bound on the coefficients of the gaussian Ra- 
dial Basis Functions technique considered in this 
paper, see eq. (12). 

• M(e,S, d): the packing number of the set S, with 
metric d. 

• N(e,S, d): the covering number of the set S, with 
metric d. 

• n: a positive number proportional to the number 
of parameters of the approximating function. Usu- 
ally will be the number of basis functions for the 
RBF technique or the number of hidden units for 
a multilayer perceptron. 

• -P(x): a probability distribution defined on X. It 
is the probability distribution according to which 
the data are drawn from X. 

• _P(j/|x): the conditional probability of the response 
y given the predictor x. It represents the proba- 
bilistic dependence of y from x. If there is no noise 
in the system it has the form _P(j/|x) = 6(y— /i(x)), 
for some function h, indicating that the predictor 
x uniquely determines the response y. 

• -P(x, y): the joint distribution of the predictors and 
the response. It is a probability distribution on 
X x Y and has the form 



• S: it will usually denote a metric space, endowed 
with a metric d. 

• S: a generic subset of a metric space S. 

• T: a generic e-cover of a subset S C S. 

• U: it gives a bound on the elements of the class A. 
In the specific case of the class A considere in the 
proof we have U = 1 + MV . 

• U: the set of all the functions from X to Y for 
which the expected risk is well defined. 

• V: a bound on the Gaussian basis function G: 

< G(x) < V , Vx e R k . 

• X: a subset of R k , not necessarily proper. It is the 
set of the independent variables, or predictors, or, 
in the language of neural networks, input variables. 

• x: a generic element of X, and therefore a k- 
dimensional vector (in the neural network language 
is the input vector). 

• Y: a subset of R, whose elements represent the 
response variable, that in the neural networks lan- 
guage is the output of the network. Unless other- 
wise stated it will be assumed to be compact, im- 
plying that T is a set of bounded functions. In pat- 
tern recognition problem it is simply the set {0, 1}. 

• y: a generic element of Y , it denotes the response 
variable. 

B A Useful Decomposition of the 
Expected Risk 

We now show that the function that minimizes the ex- 
pected risk 



P(x,i/) = P(x)P(i/|x) 
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/[/] = / P(x, y)d*dy{y - /(x)) 2 . 

JXxY 

is the regression function defined in eq. (3). It is suffi- 
cient to add and subtract the regression function in the 
definition of expected risk: 

I[f\ = Jxxy dxdyP(x, y)(y - / (x) + / (x) - /(x)) 2 
= / Xxy rfxdyP(x,(/)(j/-/o(x)) 2 + 
+ f Xxy dx^(x,j/)(/ (x)-/(x)) 2 + 

+ 2f Xxy dxdj/P(x,j/)(j/-/o(x))(/ (x)-/(x)) 

By definition of the regression function /o(x), the cross 
product in the last equation is easily seen to be zero, and 
therefore 



/[/] = / dxP(x)(/ (x) - /(x)) 2 + I[f ] . 
Jx 

Since the last term of /[/] does not depend on /, the 
minimum is achieved when the first term is minimum, 
that is when /(x) = /o(x). 



In the case in which the data come from randomly 
sampling a function / in presence of additive noise, e, 
with probability distribution V{e) and zero mean, we 
have _P(j/|x) = V(y — /(x)) and then 



I\fo] 



dxdyP( X ,y)(y-f ( X )) 2 



XxY 



(15) 



facP(x) / (y - f(x)yV(y - /(x)) = (16) 

x jy 

= f rfxP(x) / c 2 V(c)dc = a 2 (17) 
ij Jy 

where a 2 is the variance of the noise. When data are 
noisy, therefore, even in the most favourable case we 
cannot expect the expected risk to be smaller than the 
variance of the noise. 



I[f n ] 



I[?n,ll 



2e 



2e 



'emp 



[fn 



■emp 



fn,] 



Figure 7: If the distance between I[f n ] and I[f n ,i] is 
larger than 2e, the condition I emp [f„ y i] < I em p[fn] is vi- 
olated. 



C A Useful Inequality 

Let us assume that, with probability 1 — 6 a uniform 
bound has been established: 

IW/]-/[/]|<w(/,M) v/eF n . 

We want to prove that the following inequality also 
holds: 

\I[f„]-I[f„,i}\<2io(l,n,8). (18) 

This fact is easily established by noting that since the 
bound above is uniform, then it holds for both /„ and 
/„;, and therefore the following inequalities hold: 

I[fn,l] < Iemp[fn,l] +U 



I 



empLfn] < I[fn] +U 



Moreover, by definition, the two following inequalities 
also hold: 

I[fn] < I[fn,l] 
J-emplJnJi _^ ^empLJnJ 

Therefore tha following chain of inequalities hold, prov- 
ing inequality (18): 



definitions and notation will be introduced as and when 
the necessity arises. 

We have seen in section 2 (statement 2.1) that the 
generalization error can be bounded, with probability 
1 — 6, as follows: 

Wfo ~ fn,i\\h( P ) < <ri) +Ml,n,6) . (19) 

In the next parts we will derive specific expressions for 
the approximation error e and for the estimation error 
w in order to prove theorem (4.1). 

D.l Bounding the approximation error 

In this part we attempt to bound the approximation er- 
ror. In section 3 we assumed that the class of functions 
to which the regression function belongs, that is the class 
of functions that we want to approximate, is 

T={feL 2 (R k )\f = X*G,\X\ Rk <M}, 

where A is a signed Radon measure on the Borel sets 
of R k , G is a gaussian function with range [0, V], the 
symbol * stands for the convolution operation, |A|^k is 
the total variation of the measure A and M is a positive 
real number. Our approximating family is the class: 



I[fn] < I[fn,l] < 4mp[/n,;]+W < J e mp[/n]+W < /[/n]+2w 

An intutitive explanation of these inequalities is also ex- 
plained in figure (7). 

D Proof of the Main Theorem 

The theorem will be proved in a series of steps. For clar- 
ity of presentation we have divided the proof into four 
parts. The first takes the original problem and breaks it 
into its approximation and estimation components. The 
second and third parts are devoted to obtaining bounds 
for these two components respectively. The fourth and 
final part comes back to the original problem, reassem- 
bles its components and proves our main result. New 



H n = {/ e L 2 \f = J2 AG(x-t,-), Y^\/3i\<M, t,- e R k ) 

8=1 8=1 

It has been shown in [33, 34] that the class H n uniformly 
approximate elements of T , and that the following bound 
is valid: 



E[(f0 ~ fnf] < O 



(20) 
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This result is based on a lemma by Jones [48] on the 
convergence rate of an iterative approximation scheme 
in Hilbert spaces. A formally similar lemma, brought to 
our attention by R. Dudley [25] is due to Maurey and 
was published by Pisier [65]. Here we report a version 



of the lemma due to Barron [6, 7] that contains a slight 
refinement of Jones' result: 

Lemma D.l (Maurey- Jones- Barron) If f is in the 

closure of the convex hull of a set Q in a Hilbert space H 
with \\g\\ < b for each g £ Q , then for every n > 1 and 
for c > b 2 — ||/|| 2 there is a f n in the convex hull of n 
points in Q such that 



11/ "/n 



< 



In order to exploit this result one needs to define suitable 
classes of functions which are the closure of the convex 
hull of some subset Q of a Hilbert space H . One way 
to approach the problem consists in utilizing the integral 
representation of functions. Suppose that the functions 
in a Hilbert space H can be represented by the integral 



/(x) 



M 



G t (x)rfa(t) 



(21) 



where da is some measure on the parameter set M, and 
G^(x) is a function of H parametrized by the parameter 
t, whose norm ||G^(x)|| is bounded by the same number 
for any value of t. If da is a finite measure, the integral 
(21) can be seen as an infinite convex combination, and 
therefore, applying lemma (D.l) one can prove that there 
exists n coefficients c; and n parameter vectors t; such 
that 

n 

n/-;>>G ti (x)ii 2 <o(-) 

8 = 1 

For the class T we consider, it is clear that functions 
in this class have an integral representation of the type 
(21) in which G t (x) = G(x-t), and the work in [33, 34] 
shows how to apply lemma (D.l) to this class. 

Notice that the bound (20), that is similar in spirit to 
the result of A. Barron on multilayer perceptrons [6, 8], 
is interesting because the rate of convergence does not 
depend on the dimension d of the input space. This is 
apparently unusual in approximation theory, because it 
is known, from the theory of linear and nonlinear widths 
[78, 64, 54, 55, 20, 19, 21, 56], that, if the function that 
has to be approximated has d variables and a degree of 
smoothness s, we should not expect to find an approxi- 
mation technique whose approximation error goes to zero 
faster than 0{n~~*). Here "degree of smoothness" is a 
measure of how constrained the class of functions we con- 
sider is, for example the number of derivatives that are 
uniformly bounded, or the number of derivatives that are 
integrable or square integrable. Therefore, from classi- 
cal approximation theory, we expect that, unless certain 
constraints are imposed on the class of functions to be 
approximated, the rate of convergence will dramatically 
slow down as the number of dimensions increases, show- 
ing the phenomenon known as "the curse of dimension- 
ality" [11]. 

In the case of class T we consider here, the constraint 
of considering functions that are convolutions of Radon 
measures with Gaussian seems to impose on this class of 
functions an amount of smoothness that is sufficient to 



guarantee that the rate of convergence does not become 
slower and slower as the dimension increases. A longer 
discussion of the "curse of dimensionality" can be found 
in [34]. 

We notice also that, since the rate (20) is independent 
of the dimension, the class T , together with the approx- 
imating class H n , defines a class of problems that are 
"tractable" even in a high number of dimensions. 

D.2 Bounding the estimation error 

In this part we attempt to bound the estimation error 
|7[/] — 7 emp [/]|. In order to do that we first need to 
introduce some basic concepts and notations. 

Let S be a subset of a metric space S with metric d. 
We say that an e-cover with respect to the metric d is 
a set T £ S such that for every s £ S, there exists some 
t £ T satisfying d(s,t) < e. The size of the smallest 
e-cover is Af(e, S, d) and is called the covering number 
of S. In other words 



M{e,S,d) 



min \T\ 

TCS 



where T runs over all the possible e-cover of S and \T\ 
denotes the cardinality of T. 

A set B belonging to the metric space S is said to 
be e-separated if for all x,y £ B, d(x,y) > e. We 
define the the packing number M(e,S, d) as the size of 
the largest e-separated subset of S. Thus 



M(e,S,d) 



max 15 1 
bcs 



where B runs over all the e-separated subsets of S. It is 
easy to show that the covering number is always less than 
the packing number, that is Af(e, S, d) < M{e, S, d). 

Let now P(£) be a probability distribution defined on 
S, and A be a set of real- valued functions defined on S 
such that, for any a £ A, 

< a(0 < U 2 V^ £ S . 

Let also £ = (£i, .., £;) be a sequence of / examples drawn 
independently from S according to P(£). For any func- 
tion a £ A we define the empirical and true expectations 
of a as follows: 



£M=yI>te) 



E[a] 



dtP(£H0 



The difference between the empirical and true expecta- 
tion can be bounded by the following inequality, whose 
proof can be found in [69] and [42], that will be crucial 
in order to prove our main theorem. 

Claim D.l ([69], [42]) Let A and £ be as 

above. Then, for all e > 0, 



ned 



P(3aeA: \E[a]-E[a]\ >e < 



<4E[^(^,A^,d L1 )] 
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In the above result, A? is the restriction of A to the data 
set, that is: 



At = {(a(Z 1 ),...,a(Z,)):aeA}. (22) 

The set A? is a collection of points belonging to the 
subset [0, U] 1 of the /-dimensional euclidean space. Each 
function a in A is represented by a point in A?, while 
every point in A? represents all the functions that have 
the same values at the points £i,...,£f. The distance 
metric d^\ in the inequality above is the standard L 1 
metric in R l , that is 



i ( x >y) = 7 



M =l 



■!T 



where x and y are points in the /-dimensional euclidean 
space and x^ and y^ are their //-th components respec- 
tively. 

The above inequality is a result in the theory of uni- 
form convergence of empirical measures to their under- 
lying probabilities, that has been studied in great detail 
by Pollard and Vapnik, and similar inequalities can be 
found in the work of Vapnik [81, 82, 80], although they 
usually involve the VC dimension of the set A, rather 
than its covering numbers. 

Suppose now we choose S = X x Y , where X is an 
arbitrary subset of R k and Y = [-M, M] as in the for- 
mulation of our original problem. The generic element 
of S will be written as £ = (x, y) £ X x Y . We now 
consider the class of functions A defined as: 



A = {a :XxY -► R\a(x,y) = (y-h(x)) 2 , he H n (R k )} 

where H n (R ) is the class of fc-dimensional Radial Basis 
Functions with n basis functions defined in eq. 12 in 
section 3. Clearly, 

\y-h(x)\< \y\ + \h(x)\<M + MV, 
and therefore 



< a < U 2 
where we have defined 

U= M + MV . 
We notice that, by definition of E(a) and E(a) we have 

1 ' 
E ( a ) = 1 YliVi - A(x,-)) 2 = I emp [h] 



P(VheH n ,\I emp [h]-I[h}\<e)> 
> 1 - 4E[Af(e/l6,A^d L i)]e-T±^' 



(23) 



so that the inequality of claim D.l gives us a bound on 
the estimation error. However, this bound depends on 
the specific choice of the probability distribution P(x, y), 
while we are interested in bounds that do not depend on 
P. Therefore it is useful to define some quantity that 
does not depend on P, and give bounds in terms of that. 
We then introduce the concept of metric capacity 
of A, that is defined as 

C(e,A,d L i) = sup{W(e, A,d L i (P) )} 
p 

where the supremum is taken over all the probability 



>-L 1 (P) 



distributions P defined over S, and 
L l (P) distance 9 

induced by the probability distribution P 



is standard 



dmP)(ai,a 2 )= d£-P(£)l a i(0 - a 2 (0\ a 1 ,a 2 EA. 
■Js 

The relationship between the covering number and the 
metric capacity is showed in the following 

Claim D.2 

E[Sf(e,Az,d L i)]<C(e,A,d L i) . 

Proof: For any sequence of points £ in S, there is a triv- 
ial isometry between (A^d^i) and (A, d^itp \) where 
Pf is the empirical distribution on the space S given 

by j X2i=i ^(£ — &)• Here 6 is the Dirac delta func- 
tion, £ £ S, and £; is the i-th element of the data 
set. To see that this isometry exists, first note that 
for every element a £ A, there exists a unique point 
(a(£i), . . . , a(£i)) £ A?. Thus a simple bijective mapping 
exists between the two spaces. Now consider any two 
elements g and h of A. The distance between them is 
given by 



dmp s) (g,h)= / \g(0-H0\m) d t 



8 = 1 



b(6-)-M6-)l- 



This is exactly what the distance between the two points 

(ff(£i), •;,#(&)) and (M£i), ••,/*(&)), which are elements 
of Ac, is according to the d^i distance. Thus there is 



9 Note that here A is a class of real- valued functions de- 
fined on a general metric space S. If we consider an arbitrary 
A defined on S and taking values in R n , the d L ir P \, norm is 
appropriately adjusted to be 



E(a) 



dxdy P(x,y)(y - A(x)) 2 = I[h] 



XxY 



Therefore, applying the inequality of claim D.l to the 
set A, and noticing that the elements of A are essentially 
defined by the elements of H n , we obtain the following 
result: 
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d 



LHP) 



(f>g) 



n *-^ 

8 = 1 



|/i(x)- ffi (x)|P(x)dx 



where f(x) = (^(x), . . . / 8 (x), . . . /„(x)), g(x) _ = 
(</i(x), . . . </;(x), . . . </ n (x)) are elements of A and -P(x) is a 
probability distribution on S. Thus d L i and d L ir P \ should 
be interpreted according to the context. 



a one-to-one correspondence between elements of A and 
A? and the distance between two elements in A is the 
same as the distance between their corresponding points 
in At. Given this isometry, for every e-cover in A, there 
exists an e-cover of the same size in A?, so that 



M(e,A^,d L i) = N{e,A,d LHPi) ) <C(e,A,d L i). 
and consequently E[Af(e, At , dpi)\ < C{e, A, dpi). □ 

The result above, together with eq. (23) shows that the 
following proposition holds: 

Claim D.3 



P(Vh £ H n ,\I emp [h] - I[h}\ < e) > 



> l-4C(e/16,A,d L i)]e-T5ZU* 



t'\ 



(24) 



Thus in order to obtain a uniform bound u> on |/ e mp [h] — 
I[h]\, our task is reduced to computing the metric capac- 
ity of the functional class A which we have just defined. 
We will do this in several steps. In Claim D.4, we first 
relate the metric capacity of A to that of the class of ra- 
dial basis functions H n . Then Claims D.5 through D.9 
go through a computation of the metric capacity of H n . 

Claim D.4 

C{e,A,d L i)<C{e/W,H n ,d L i) 

Proof: Fix a distribution P on S = X x Y. Let Px 
be the marginal distribution with respect to X. Sup- 
pose K is an e/4?7-cover for H n with respect to this 
probability distribution Px, i.e. with respect to the dis- 
tance metric dpitp x \ on H n . Further let the size of A' be 
M(e/AU, H n , dpitp x \). This means that for any h £ H n , 
there exists a function h* belonging to K , such that: 

/ |/i(x) - /i*(x)| J P x (x)rfx < e/AU 

Now we claim the set H(K) = {(y — /i(x)) 2 : h £ A'} 
is an e cover for A with respect to the distance metric 
dpitpy To see this, it is sufficient to show that 

/ 1(2/ - M*)) 2 " (y - h*(x)) 2 |P(x, y)dxdy < 

< / 2|(2j/ -h- h*)\\(h - h*)\P(x, y)d*dy < 

< J2(2M + 2MV)\h- h*\P(x,y)dxdy < e 
which is clearly true. Now 

A r (e,A,d L1(P) )<\H(K)\ = 
= calN(e/4U,H n ,d LH p x) )< 

<C(e/4U,H n ,d L i) 

Taking the supremum over all probability distributions, 
the result follows. □ 



So the problem reduces to finding C{e, H n , dpi), i.e. the 
metric capacity of the class of appropriately defined Ra- 
dial Basis Functions networks with n centers. To do this 
we will decompose the class H n to be the composition of 
two classes defined as follows. 

Definitions/Notations 

Hi is a class of functions defined from the metric space 
(R k ,dpi) to the metric space (R n ,dpi). In particular, 

Hi = {g(x) = (G(||x- tl ||), G(||x-t 2 ||), . . . , G(||x-t„||))} 

where G is a Gaussian and t; are fc-dimensional vectors. 
Note here that G is the same Gaussian that we have been 
using to build our Radial-Basis-Function Network. Thus 
Hi is parametrized by the n centers t{ and the variance 
of the Gaussian a 2 , in other words nk + 1 parameters in 
all. 

Hp is a class defined from the metric space 
([0, V] n , dpi) to the metric space (R,dpi). In particu- 
lar, 



H F = {/i(x) = (3 ■ x, x £ [0, V] n and ^ \fc\ < M] 

8 = 1 

where /3 = (/?i, . . . , /?„) is an arbitrary n-dimensional 
vector. 

Thus we see that 

H n = {hp o hi : hp £ Hp and hi £ Hi} 

where o stands for the composition operation, i.e., for 
any two functions / and g, f o g = f(g(x)). It should 
be pointed out that H n as defined above is defined from 
R k to R. 



Claim D.5 



C{,,Hi,d L i)<T\— \n f2eV 



£ V £ 



i(k + 2) 
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Proof: Fix a probability distribution P on R k . Consider 
the class 

= {«/:«/(x) = G(||x-t||), t£A & }. 

Let K be an M{e,Q ,dpitp\)-s\ieA e cover for this class. 
We first claim that 

T = {(hi, ..,h n ) : hi £ A} 

is an e-cover for Hi with respect to the dpitp\ metric. 

Remember that the dpitp\ distance between two 
vector-valued functions g(x) = (</i(x), .., g n (x)) and 
g*(x) = ((/i(x),..,^(x)) is defined as 

1 " f 
<*ii(P)(g,g*) = -E/ lff.-to-</?(x)|P(x)dx 

z' = l 

To see this, pick an arbitrary g = (g\, . . . ,g n ) £ Hi. 
For each g{, there exists a g* £ K which is e-close 



in the appropriate sense for real-valued functions, i.e. 
dmp){9i,9*) < £• The function g = (gl,..,g* n ) is an 
element of T. Also, the distance between (g\, ..,g n ) and 
{fi, ■-,§*„) in the d L i (P) metric is 



Claim D.6 



1 
<*Li(P)(g,g*)< -Y, 

Thus we obtain that 



n 



W(e,F / ,rf L1( p))<[W(e,a,rf il (P))r 
and taking the supremum over all probability distribu- 
tions as usual, we get 

C(e,ffj,dLi)<(C(e,0,<* L i)) n . 
Now we need to find the capacity of Q . This is done in 
the Claim D.6. From this the result follows. □ 

Definitions/Notations 

Before we proceed to the next step in our proof, some 
more notation needs to be defined. Let A be a fam- 
ily of functions from a set S into R. For any sequence 
£ = (£ 1; ..,£d) of points in S, let A? be the restriction 
of T to the data set, as per our previously introduced 
notation. Thus Ap = {(a(£i), • • • , a(£d)) : a £ A}. If 
there exists some translation of the set A?, such that 

it intersects all 2 d orthants of the space R d , then £ is 
said to be shattered by A. Expressing this a little more 
formally, let B be the set of all possible /-dimensional 
boolean vectors. If there exists a translation t £ R d 
such that for every b £ B, there exists some function 
a-fo £ A satisfying aj-,^) — ti > &i -O- b{ = 1 for all i = 1 
to d, then the set (£1, ..,£<j) is shattered by A. Note that 
the inequality could easily have been defined to be strict 
and would not have made a difference. The largest d 
such that there exists a sequence of d points which are 
shattered by A is said to be the pseudo-dimension of A 
denoted by pdim.4. □ 

In this context, there are two important theorems which 
we will need to use. We give these theorems without 
proof. 

Theorem D.l (Dudley) Let F be a k-dimensional 
vector space of functions from a set S into R. Then 
pdim(F) = k. 

The following theorem is stated and proved in a some- 
what more general form by Pollard. Haussler, using tech- 
niques from Pollard has proved the specific form shown 
here. 

Theorem D.2 (Pollard, Haussler) Let F be a fam- 
ily °f functions from a set S into [Mi,M 2 ], where 
pdim(F) = d for some 1 < d < 00. Let P be a prob- 
ability distribution on S. Then for all < e < M 2 — M\, 



1, 



M(e, F, d LKP) ) < 2 I -2e(M 2 - Mi) log -2e(M 2 - Mi) 

Here M{e, F, rf^irp)) is the packing number of F accord- 
ing to the distance metric d^itpy 



CMMK^™*'™ 



+ 2) 



Proof: Consider the k + 2-dimensional vector space of 
functions from R k to R defined as 



G\ = spanjl, x , x , •, x , ||x|| } 

where x £ R k and x^ is the //-th component of the vector 
x. Now consider the class 



G 2 = {( 



-f 



/£d, a 



1 
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} 



where a is the standard deviation of the Gaussian, and 
/ £ G\ . We claim that the pseudo-dimension of Q de- 
noted by pdim(CJ) fulfills the following inequality, 

pdim (G) < pdim (G 2 ) = pdim (d) = (k + 2). 

To see this consider the fact that Q C G 2 . Conse- 
quently, for every sequence of points x = (xi, . . .,x<j), 
Gx C (G 2 ) x - Thus if (xi, . . . , x<j) is shattered by Q, it 
will be shattered by G 2 . This establishes the first in- 
equality. 

We now show that pdim(G 2 ) < pdim(Gi). It is 
enough to show that every set shattered by G 2 is 
also shattered by G\ . Suppose there exists a sequence 
(xi, x 2 , . . . , Xd) which is shattered by G 2 . This means 
that by our definition of shattering, there exists a 
translation t £ R d such that for every boolean vec- 
tor b £ {0, \} d there is some function ju = ae b 
where /-u £ G\ satisfying g-u(xi) > ti if and only 
if hi = 1, where ti and b{ are the i-th components 
of t and b respectively. First notice that every func- 
tion in G 2 is positive. Consequently, we see that ev- 
ery ti has to be greater than 0, for otherwise, <7j(x;) 
could never be less than ti which it is required to be 
if h{ = 0. Having established that every ti is greater 
than 0, we now show that the set (xi, x 2 , . . . , x<j) is 
shattered by G\. We let the translation in this case be 
t' = (log(ti/a), log(t 2 /a), . . . , log(td/a)). We can take 
the log since the ti/a's are greater than 0. Now for ev- 
ery boolean vector b, we take the function — /j £ G\ and 
we see that since 



9b 



~ fb >ti^bi = 1. 



if follows that 
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-fb > log(ti/a) = t't O b{ = 1. 

Thus we see that the set (xi , x 2 , . . . , x<j) can be shattered 
by G\. By a similar argument, it is also possible to show 
that pdim(Gi) > pdim(G 2 ). 

Since G\ is a vector space of dimensionality k + 2, an 
application of Dudley's Theorem [24] yields the value 
k + 2 for its pseudo-dimension. Further, functions in 
the class Q are in the range [0, V]. Now we see (by an 
application of Pollard's theorem) that 



N(e, g, d L i (P) ) < M(e, g, d LHP) ) < 



<2(^ln(^)) 



,pdim(e) 



< 



2eV i (2eV\\( k + 2 ) 



< 2 (2^m(2fA) 

Taking the supremum over all probability distributions, 
the result follows. □ 



Claim D.7 



Proof: The proof of this runs in very similar fashion. 
First note that 

H F C{/3-x:x, /3£i? n }. 

The latter set is a vector space of dimensionality n and by 
Dudley's theorem[24], we see that its pseudo-dimension 
pdim is n. Also, clearly by the same argument as in the 
previous proposition, we have that pdim(_ff P ) < n. To 
get bounds on the functions in Hp, notice that 



J2 #*•■! ^ J2 i#in < yl 52\Pi\< MV - 



Claim D.9 

C(e, H n , d L1 ) < C(^-, H It d L1 )C( € -,H F , d L1 ) 

Proof: Fix a distribution P on R k . Assume we have 
an e/(2Mn)-cover for Hi with respect to the probability 
distribution P and metric d F it P y Let it be K where 

\K\=M{e/2Mn,H h d LHP) ). 

Now each function / £ K maps the space R k into R" , 
thus inducing a probability distribution Pj on the space 
R n . Specifically, Pj can be defined as the distribution 
obtained from the measure /if defined so that any mea- 
surable set A C R n will have measure 



»M) = / ^(x)rfx . 

Further, there exists a cover Kj which is an e/2-cover 
for Hp with respect to the probability distribution Pj . 
In other words 

\K s \=N(e/2,H F ,d LHPj) ). 
We claim that 

H(K) = {/ o g : g £ K and / £ K g } 
is an e cover for H„ . Further we note that 



Thus functions in Hp are bounded in the range 
[-MV,MV]. Now using Pollard's result [42], [69], we 
have that 

M{e, H F , d LHP) ) < M(e, H F , d LHP) ) < 

< 2 ( 4MeV i ( 4MeV \\ n 

Taking supremums over all probability distributions, the 
result follows. □ 

Claim D.8 A uniform first-order Lipschitz bound of 
H F is Mn. 

Proof: Suppose we have x, y £ R n such that 

^Li(x,y) < e. 

The quantity Mn is a uniform first-order Lipschitz 
bound for H F if, for any element of H F , parametrized 
by a vector /3, the following inequality holds: 

|x-/3-y ■ [3\ < Mne 
Now clearly, 

|x-/3-y-/3| = | £?=!/?.■(*.■ -!/.■)!< 

<E?=ilA-||(^ -!/.-)!< 

<MJ2" =1 \(xi-yi)\<Mne 
The result is proved. □ 



\H(K)\ = EjeK \ K f I < EjeK C(e/2, H F , d L ,) < 
< Af(e/(2Mn), H F d L1(P) )C(e/2, H F , d L i) 

To see that H(K) is an e-cover, suppose we are given an 
arbitrary function hj o hi £ H n . There clearly exists a 
function h* £ K such that 

d L i(/»i(x), /i*(x)) J P(x)rfx < e/(2Mn) 

R k 

Now there also exists a function h% £ K^* such that 

j Rk \h ! oh*(*)-h)oh*(K)\P(K)d* = 

= J Rn \h f (y) - h* f (y)\P K (y)dy < e/2 . 

To show that H(K) is an e-cover it is sufficient to show 
that 



R k 



\hf O hi(x) - h) O h*(x)\P(x)dx < 6. 



Now 
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J Rk \h f o A,-(x) -h)° /i*(x)| J P(x)rfx < 
<I R A\hfoh t ( X )-h f oh*( X )\ + 

+ \h f o /i*(x) -h)o /i*(x)| J P(x)rfx} 

by the triangle inequality. Further, since hj is Lipschitz 
bounded, 



J Rk \h f o /,,.(x) - hj o /,,*(x)|P(x)dx < 

< J^ Mnd L i(h i (x.),h*(x.))P(x.)dx < Mn(e/2Mn) < e/2 



Also, 



J i? ,|/ l/ o/ J *(x)-^o/ J *(x)|P(x)rfx = 

= / i? J^(y)-^(y)l^i(y)rfy<^/2 



for constants A,B. The latter inequality is satisfied as 
long as 

(An/ e ) 2n(&+3) e- e2;/B < - 
which implies 

2n(k + 3)(ln(An) - ln(e)) - e 2 //5 < ln(<5/4) 
and in turn implies 

e 2 l > B ln(4/<5) + 25n(fc + 3)(ln(An) - ln(e)). 



Consequently both sums are less than e/2 and the total We now snow that the above inequality is satisfied for 
integral is less than e. Now we see that 

B [ln(4/<S) + 2n(k + 3) \n(An) + n(k + 3) ln(/)] \ 1/2 



M{e, H n ,d LHP) ) < M (e/(2Mn), H u d L1(P) ) C(e/2, H F , rf L1 ) £ : 

Taking supremums over all probability distributions, the Putting the above value of e in the inequality of interest 



result follows. □ 



we get 



Having obtained the crucial bound on the metric capac- t 2 (l/B) = ln(4/<5) + 2n(k + 3) \n(An) + n(k + 3) ln(/) > 

ity of the class H n , we can now prove the following 



Claim D.10 With probability 1 — 6, and V/i 6 H n , the 
following bound holds: 



\I emp [h] - I[h}\ < O 



nkln(nl) + ln(l/6) 



1/2N 



> ln(4/<5) + 2n(k + 3) ln( An) + 

+ 2n(k + 3)| In i yB [ ln ^/ S ^ + 2n ^+3)ln(An)+n(k+3)ln(l)] 

In other words, 
n(k + 3)\n(l) > 



Proof: We know from the previous claim that 
C(e,H n ,d L i)< 

< 2 n + ! \ 4MeVn | / AMeVn \ 1 "0 + 2) r8MeV_j ( 8MeV \l n 

< [ 8MeVn | / 8Me Vn \1 "P+3) 



< 



> n(fc + 3)ln ^ B [ ln( -4/^) + 2 n ( fc+ 3)i n (^ n ) +n ( fc+ 3)i n (;)] 

Since 



B [ln(4/<5) + 2n(k + 3) ln(An) + n(k + 3) ln(/)] > 1 
the inequality is obviously true for this value of e. Taking 
this value of e then proves our claim. □ 



From claim (D.3), we see that 

P(ih £ H n ,\I emp [h] - I[h]\ < e) > 

>l-6 
as long as 

C(e/16,A,d L i)e-T^ c2 ' < - 
which in turn is satisfied as long as (by Claim D.4) 

C(6/64U,H n ,d L1 )e-Tdu?c 2 ' < I 
which implies 

(±256MeVUn In (±256MeVUn)) n(k+3) • 

•e 128U 2 < ^ 



D.3 Bounding the generalization error 

Finally we are able to take our results in Parts II and III 
to prove our main result: 

(25) Theorem D.3 With probability greater than 1 — 6 the 
following inequality is valid: 



\\h-fnA\l HP) <o[- +o 



nk ln(n/) — In 6 



1/2N 



In other words, 



Proof: We have seen in statement (2.1) that the gener- 
alization error is bounded as follows: 

Wfo ~ fn,i\\h( P ) < <ri) + 2co(l,n,6) . 
In section (D.l) we showed that 

1 



1/2N 



e(n) = 1- 



and in claim (D.10) we showed that 



ijj{l, n, 6) = O 



An (An 
In 



n(k+3) 



nk ln(n/) — In <5 



e Jl ' B < - 
~ 4 
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Therefore the theorem is proved putting these results 
together. □ 
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